Computational identification of transcription factor binding 
sites by functional analysis of sets of genes sharing overrep- 
resented upstream motifs 

Davide Cora\ Ferdinando Di Cunto^ , Paolo Provero'^ , Lorenzo Silengo^ and Michele Caselle*^ 

^ Dipartimento di Fisica Teorica, Universita di Torino, and INFN, sezione di Torino, Italy 
^ Dipartimento di Genetica, Biologia e Biochimica, Universita di Torino, Torino, Italy 

^ Fondazione per le Biotecnologie, Torino, Italy 

Email: Davide Cora - cora@to.infn.it; Ferdinando Di Cunto - ferdinando.dicunto@unito.it; Paolo Provero - 
paolo.provero@fobiotech.org; Lorenzo Silengo - lorenzo.silengo@unito.it; Michele Caselle*- caselle@to.infn.it; 

'Corresponding author 



Abstract 

Background: Transcriptional regulation is a key mechanism in the functioning of the cell, and is mostly effected 
through transcription factors binding to specific recognition motifs located upstream of the coding region of the 
regulated gene. The computational identification of such motifs is made easier by the fact that they often appear 
several times in the upstream region of the regulated genes, so that the number of occurrences of relevant motifs 
is often significantly larger than expected by pure chance. 

Results: To exploit this fact, we construct sets of genes characterized by the statistical overrepresentation of a 
certain motif in their upstream regions. Then we study the functional characterization of these sets by analyz- 
ing their annotation to Gene Ontology terms. For the sets showing a statistically significant specific functional 
characterization, we conjecture that the upstream motif characterizing the set is a binding site for a transcription 
factor involved in the regulation of the genes in the set. 

Conclusions: The method we propose is able to identify many known binding sites in S. cerevisiae and new 
candidate targets of regulation by known transcritpion factors. Its application to less well studied organisms is 
likely to be valuable in the exploration of their regulatory interaction network. 



Background 

The regulation of gene expression in the eukaryotic cell happens at several different levels, the transcriptional 
one being among the most important. The general mechanism is fairly well understood, and involves the 
interaction between a trans-acting element, usually a protein, and a cis-acting element, a recognition site 
located upstream of the coding region of the regulated gene and consisting in a rather short DNA sequence 
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to which the transcription factor is able to bind. When bound to the cis-acting elements, the trans-acting 
ones interfere with the transcription machinery, and can either enhance or suppress the synthesis of mRNA. 

While many instances of this mechanism have been known in great detail for some time, it is only recently, 
thanks to the availability of several fully sequenced genomes and other experimental data on the scale of the 
entire genome, that a study of transcriptional regulation on a global scale has become possible. Given the 
sheer size of the data, the computational aspects of this analysis are highly non-trivial, and many algorithms 
have been proposed to select the most relevant information and exploit it towards a better understanding of 
the phenomenon. 

One of the most interesting problems is to identify, by purely computational means, candidate cis-acting 
elements, so as to choose promising targets for the experimental investigation and thus greatly enhance 
its effectiveness. At first sight this task might seem prohibitive, since the relevant upstream motifs are 
rather short sequences (in the range of 5 to 20 base pairs) to be found within hundreds or thousands of 
base pairs upstream of the coding region. However, often the relevant motifs must be repeated many times, 
possibly with small variations, in the upstream region for the regulatory action to be effective. This fact 
can be exploited to separate the signal from the noise by searching the upstream region for overrepresented 
motifs, that is motifs appearing many more times than expected by chance on the basis of suitably chosen 
background frequencies. 

This strategy was first suggested in [1] where the following method was devised to identify regulatory 
sites: consider a set of genes experimentally known or presmned to be coregulated (for example because they 
are involved in the same biological process or because they show similar expression profiles in microarray 
experiments) . Then determine which short motifs are overrepresented in their upstream region, compared to 
suitably defined background motif frequencies that take into account the basic features of non-coding DNA 
of the organism under study. These motifs are likely to be involved in the coregulation of the genes in the 
set. 

Recently, a different method was proposed by some of us [2] which, while also based on the statistical 
overrepresentation of regulatory motifs, reverses the procedure compared to [1] and to most other computa- 
tional methods: first, the genes are grouped based on the motifs that are overrepresented in their upstream 
region; then the sets of genes thus obtained are analysed from the point of view of the coregulation of the 
genes they contain, by studying their expression patterns in microarray experiments. The overrepresented 
motifs labelling the sets which do show evidence of coregulation are considered as candidate binding sites 
for the regulation of the genes in the set. The method was tested on yeast {S. cerevisiae) and gave satis- 
factory results, correctly identifying many previously known regulatory elements and proposing a few new 
candidates. 

In this work we examine the sets of genes constructed by the same recipe from a different point of view, 
namely their functional characterization based on their annotation to Gene Ontology [3] terms, the rationale 
being that if a significant fraction of the genes included in the set corresponding to a given motif share 
the same functional characterization, then it is likely that the motif is involved in their coregulation. This 
procedure for validating the sets is likely to be able to recognize different motifs compared to the validation 
through microarray data, which is limited by at least two factors: first, the specific biological processes and 
environmental conditions probed by each microarray experiment, and second, the fact that only rather large 
sets of coregulated genes are likely to produce a statistically significant signal. Therefore it can be hoped 
that this analysis of the sets will be able to complement and improve the one based on expression data. 

Results 

For all possible motifs 5 to 8 nucleotides long we generated the set of genes in whose upstream region the 
motif is overrepresented compared to its prevalence in the upstream region of all genes. A motif occurring n 
times in the upstream region of a gene is considered overrepresented if the probability of having n or more 
occurrences by pure chance, computed as described in the Methods section, is less than 0.01. 
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Given the leniency of the cutoff used to construct the sets, essentially all possible motifs turned out to be 
overrepresented in some genes. It is important to keep in mind that no biological significance is attached to 
the sets of genes thus constructed before they are tested for common functional characterization as described 
below. The average number of genes in each set varied between 11.9 for 7-letter motifs and 41.8 for 5-letter 
motifs. 

The actual number of repetitions of each motif in the upstream sequence needed for a gene to be in- 
cluded in the set greatly varies depending on the length of the motif and of the upstream sequence and the 
background frequency of the motif. For example, 9 repetitions of the motif AAATT are needed for a 500 bp 
upstream sequence, but just one instance of the motif AACCGCGT is enough to make it overrepresented. 
The minimum number of repetitions for the motifs that passed the functional characterization test is re- 
ported in the supplementary material, together with the number of occurrences of a sample motif (ACGCG) 
in the upstream regions of all the genes included in its set. 

Wc then analyzed tlic^ functional characterization of all the sets by studying the prevalence of genes 
annotated to each Gene Ontology term. The statistical significance of such prevalence was evaluated using 
the liypergeometric distribution and the false discovery rate was estimated by comparing the results with 
the ones obtained for randomly generated gene sets. With the false discovery rate set at 0.01, a total of 
107 association were established between 79 different motifs and 41 Gene Ontology terms. These results are 
shown in Tables 1-6. 



Determination of consensus sequences 

The motifs identified by the algorithm clearly form several groups: within each group the motifs are very 
similar to each other and are associated to the same or to very similar Gene Ontology terms. For each group 
of motifs we report a consensus sequence obtained by aligning (by hand) the motifs corresponding to related 
Gene Ontology terms. These consensus sequences are written in terms of the TUPAC code with the rules 
applied in the TRANSFAC database [4]. Note that the intrinsic variability of regulatory binding sites, that 
is apparently neglected by our method since we construct the sets using fixed motifs, reappears naturally in 
the results when the significant motifs are grouped together. 

To verify whether the consensus sequences thus obtained are of biological relevance, we considered for 
each of them how many genes actually contain the consensus sequence in their upstream region, and what 
fraction of these are annotated to the GO terms from which the consensus was obtained. The results 
confirm the correctness of the consensus sequence thus obtained: for example 17 genes contain the sequence 
GATGAGATGAGCT in the upstream region, and 8 of them are annotated "nucleolus" , while the sequence 
MCSMTACAY appears in the upstream region of 78 genes, 20 of which being annotated "cytosolic ribosomc" . 
The complete results are shown in Tab. 7. The only case in which the alignment does not work is the 
sequence AAMYGCGAWCG, which appears in 6 genes, none of which is annotated "nucleosome" . In 
all casc;s except the lattcir one the genes annotatcid to the GO term arc overrepresented among the ones 
containing the consensus sequence in their upstream region, the most significant P-value being 1.13 • 10~^^ 
for genes annotated "ctytosolic ribosome" and containing the consensus sequence MCSCMTACAY. 

Tables 1-6 show also the transcription factors significantly associated to each of the motifs foimd: such 
association is determined, as described in the Materials and Methods section, as a statistically significant 
intersection between the set of genes characterized by overrepresentation of the relevant motif and the set 
of genes identified in Rcf. [5] as able to interact with the transcription factor. The large fraction of motifs 
identified by our method which do have a significant intersection with the sets of genes identified in Ref. [5] 
is proof of the effectiveness of our algorithm, given the completely independent approaches used to define 
our sets of genes and the ones of Ref. [5] . Moreover this analysis allows us to associate known transcription 
factors to our candidate motifs, and helps in constructing longer upstream binding sequences from the motifs 
we have identified. 
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The candidate binding sites: detailed discussion 

The two largest groups of motifs (Tab. 1) are associated to nucleolar proteins implicated in ribosome 
biogenesis, and can be easily recognized as the known RRPE [6] and PAC [7] motifs, whose combinatorial 
effect in regulating the genes involved in ribosome biogenesis has becin recently described in [8]. Ribosomal 
proteins are associated to a different group of motifs (Tab. 2). The sets associated to these motifs have 
significant intersection with the targets of regulation by the transcription factors FHLl and RAPl, and 
the motifs align well with the binding sites for these transcription factors as reported in the TRANSFAC 
database and in Ref. [5]. 

A third large group of motifs displayed a strong overrepresentation of terms related to DNA replication. 
For this group, the consensus sequence perfectly overlaps with one of the putative Mbpl binding sites 
proposed in Ref. [5]. Interestingly, together with the regulatory subunit Swi6, this protein forms the Mlul 
Binding Factor (MBF), a crucial regulator of the Gl/S transition [9]. A few slightly variant motifs and a 
completely different one (CGGGGAGA) are associated with the annotation " micleosome" . 

Besides these large groups, in other four cases the consensus sequence obtained by merging motifs with 
similar annotation perfectly overlapped with the transcription factor binding sites predicted on the basis of 
the chromatin IP experiments, as well as with the experimentally determined sequences as reported in the 
TRANSFAC database. In particular, perfect consensus sequences for GCN4, Stel2, Hsfl and MET4 were 
obtained from motifs annotated amino acid biosynthesis, sexual reproduction, heat shock protein and sulfur 
amino acid biosynthesis, respectively (Tab. 4). 

Interestingly, in some instances a perfect match of the surviving motifs with a known binding site could be 
obtained by analysis of the literature, even though it was not reported in Chromatin IP experiments, as shown 
by the results reported in Tabs. 5 and 6. In particular, the sets characterized by the annotation 'siderophore 
transport' perfectly match the binding site for the Aftlp/2p transcription factors, recently identified as 
critical regulators of iron uptake [10]. The sets characterized by the annotation 'tricarboxylic acid cycle' 
match the known binding site of the UME6, a pleiotropic regulator implicated in glucose repression [11]. 

The sets characterized by the annotation alpha-trealose phosphate synthase complex match the Stress 
Response Element (STRE) [12], a result consistent with the described accumulation of trehalose under stress 
conditions [13]. 

The motif TCGTTTA, associated with the keyword 'steroid metabolism' corresponds fairly well to the 
TCGTATA Sterol Response Element (SRE), recently identified by promoter comparison of the ERG2 and 
ERGS genes [14]. Therefore the genes that are in the corresponding set and are annotated to "steroid 
metabolism" (listed in Tab. 8) can be considered as candidate targets of regulation by UPC2p and ECM22p, 
that is the transcription factors identified in Ref. [14] as binding to the SRE site. Interestingly, ECM22 itself 
belongs to the set corresponding to the motif TCGTTTA, hinting to a feedback mechanism in the regulation 
of steroid metabolism. 

A particularly interesting result is given by the word AAACAAA, associated with the annotation 'spindle', 
and in particular with spindle pole bodies structure and/or function. In this case, no transcription factors 
binding to the respective promoters were identified in Ref. [5]. However the motif perfectly matches the 
Hcmlp binding site, experimentally determined in the promoter of the spindle pole component SPCllO 
(WAAYAAACAAW, Ref. [15]). Accordingly, Hcmlp has been implicated, by a different ChIP analysis [16], 
as one of the potential regulators of spindle pole body genes, even though 8 of the 9 genes (the exception 
is CIN8) identified by our approach and listed in Tab. 9 are not included in the list of the 262 potential 
Hcmlp targets. More precisely Tab. 9 lists the genes that are included in the set corresponding to the motif 
AAACAAA and are annotated to the term "spindle". Our results strongly suggest that these genes are 
potential targets of the Hcml transcription factor. 

Finally the consensus sequence derived from motifs with a significant overrepresentation of 'proteasome' 
perfectly corresponds to the proteasomal element identified in Ref. [17] through a different computational 
approach, to which no transcription factors have been experimentally associated. 
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Functional characterization of the sets corresponding to candidate binding sites 

The statistical significance of the associations we made between motifs and Gene Ontology terms is based 
on the overrepresentation of genes annotated to the term in the set of genes defined by the motif: clearly 
not all genes in the various sets are annotated to the term itself. It is therefore interesting to examine the 
annotation of the genes in the set that are not annotated to the relevant term. To perform this analysis in 
a systematic way, we defined, for each Gene Ontology term, three categories of genes: 

• Genes annotated to the term, i. e. directly annotated to the term itself or to any of its descendants in 
the Gene Ontology graph 

• Genes whose annotation is compatible with the term, defined as genes that are either directly annotated 

to an ancestor of the term, or to "cellular component unknown", etc. or finally genes which do not 
have any annotation in the relevant branch of the Gene Ontology 

• All other genes are considered incompatible with the Gene Ontology term 

For example if the term at hand is "nucleolus" , a gene annotated "nucleus" is considered compatible, while 
a gene annotated "cytoplasm" is incompatible. 

A few examples of the results of this analysis: the set defined by the CCATACA motif includes a total of 
19 genes: 10 of them are annotated "ribosome biogenesis" , 2 are compatible with such annotation and the 
remaining 7 are incompatible. The AAACAAA set includes 55 genes: 9 of them are annotated "spindle" , 
31 are compatible with such annotation and 15 are incompatible. The complete results are available in the 
supplementary material. 

Localization of the candidate binding sites with respect to the Transcription Start Site 

It is interesting to investigate possible correlations between the location of the (candidate) binding sites 
that we find and the Transcription Starting Site (TSS in the following) of the genes. Ideally one would 
expect these binding sites to be preferentially located upstream with respect to the TSS and clustered in the 
first 100-200 bases upstream the TSS. This test however is rather non trivial due to the lack of systematic 
knowledge of TSS's. Among the possible sources of information on the TSS's we decided to use UTRDB [18] 
which contains 5' UTR sequences for 330 yeast genes. We looked at the intersection between these 330 
gene and our sets for the candidate binding sites listed in Tab. 1-6. We selected for a deeper analysis 
the intesection corresponding to the consensus sequence GATGAGATGAGCT. The intersection contains 6 
genes which are shown in Fig. 1 and a total of 18 instances of the candidate binding sites. As it can be seen 
from the figure the majority (66%) of the binding sites are located upstream of the TSS and half of them 
are within 200 bases from the TSS. However this a slightly misleading indicator since some of the UTR's are 
very short in length. If we look instead at the density of binding sites, i.e. at the ratio between the number 
of instances and the length of the observed sequence, we find a slightly higher density in the UTR regions 
(about 1 instance every 100 bases) with respect to the truly upstream regions (1 instance every 150 bases). 
We plan to address this subject in a more complete way in a future work. 

Control analyses 

Recently a major revision of the list of yeast ORFs, and in particular the elimination of many putative 
ORFs, was suggested in Ref. [19] through comparison with closely related organisms. To verify whether such 
changes in the annotation of the genome had an important effect on our results, we repeated our analysis 
without the ~ 500 ORFs that Ref [19] suggests to eliminate; the results however did not change in any 
significant way (data not shown), clearly because most of the ORFs to be eliminated are not annotated to 
Gene Ontology terms. 
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Finally, as a general check of the soundness of this procedure, we repeated the whole analysis on sequences 
which are not assumed to be as relevant to transcriptional regulation as the upstream sequences we have 
studied in depth. More precisely, we selected all yeast intergenic regions that are not upstream of any gene 
and we associated each region to the genes with respect to which the sequence is located downstream. Then 
we repeated the whole analysis in precisely the same way as for the upstream regions: no significant results 
were found with a false discovery rate of 0.01. 

Discussion 

The technique we have proposed in this paper is able to identify many binding sites related to functional 

groups of different size. The binding sites we identified were all previously known from different methods 
(in some cases from computational analyses only). This is explained partly by the fact that yeast is among 
the best-known organisms for what concerns transcriptional regulation, and partly by our very restrictive 
choice of false discovery rate (1%). Indeed the main goal of our analysis was to show that the method, 
and in particular the statistical analysis used to establish the significance of the results, is reliable, and can 
therefore be used on less well known organisms where it is likely to produce new candidate binding sites. 

Besides identifying new binding sites, our method could be extremely valuable in identifying new potential 
targets of known transcription factors. Indeed, in the case of Hcmlp we were able to identify 8 new potential 
target genes that were not identified in chromatin immunoprecipitation experiments. 

An important aspect of our technique is that the gene sets arc based exclusively on the upstream sequence. 
Therefore the same sets can be validated from the biological point of view in many different ways, of which 
the use of microarray results, as in Ref. [2] , of Gene Ontology annotations and of the results of chromatin 
immunoprecipitation experiments (as in the present paper) are just the most natural examples. These 
different validation methods can reveal different binding sites and should be combined whenever possible: for 
example of all the binding sites revealed in the present analysis, only the ones pertaining to the regulation of 
ribosomc biogenesis and the stress response element CCCCT had been previously identified by the microarray 
validation of Ref. [2]. Conversely, the AGCCGCGC binding site of the UME6 transcription factor, correctly 
identified in Ref. [2], was not found in the present one. 

Another possible development of this line of research is the construction of sets of genes labeled by 
spaced dyads instead of simple sequences [20]. These sets could then be validated either based on the 
overreprescntation of Gene Ontology annotations, as in the present work, or based on their coexpression as 
determined in microarray experiments as in Ref. [2] . 

Conclusions 

We have presented a new algorithm for the identification by computational means of transcription factor 
binding sites. The algorithm proceeds by first defining sets of genes having in common the overrepresentation 
of a certain short DNA sequence in their upstream region, and then selecting among all these sets the ones 
showing a statistically significant overrepresentation of genes annotated to specific Gene Ontology terms. 

When applied to yeast the algorithm is able to independently find many known transcription factor 
binding sites. It is therefore likely that our algorithm could be profitably used in the study of transcripional 
regulation in higher eukaryotes and in humans in particular. Here the method needs to be modified, to take 
into account the increased complexity of the problem: the strategy we are currently following is to limit the 
analysis to those upstream regions that arc highly conserved in mouse and human. These conserved regions 
are likely to contain many binding sites, and are short enough to provide a suitable signal to noise ratio. 
The results will be presented in a forthcoming publication. 

While most regulatory binding sites are found upstream of the coding sequence, regulatory regions can 
often be found, for example, in the 3' region or within the first intron: the method we have presented can 
in principle be applied, with suitable modifications, to the investigation of such regions. Moreover, similar 
methods could be devised for the computational identification of other functionally relevant DNA sites. 
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Methods 

Upstream sequences 

For all yeast open reading frames (ORFs) we considered up to 500 bp upstream of the translation start, 
cutting the sequence shorter when necessary to prevent overlapping with the coding region of neighboring 
ORFs. The sequences were obtained from the Regulatory Sequence Analysis Tool [21]. 



Purging repeated upstream sequences 

Genes with highly similar upstream regions are a likely source of false positives, especially when similarly 
annotated. Consider for example the ASPS gene, which appears in four copies in the genome, with nearly 
identical upstream regions, and identical Gene Ontology annotations. Clearly these four upstream regions 
must be counted as one to avoid introducing a bias that is likely to generate false positives. To avoid these 
problems, we first used BLAST [22] to list all the pairs of nearly identical upstream sequences (Blast P-value 
less than 10~^"). Then we used these results to form clusters of nearly identical upstream regions (that 
is the connected components of the graph constructed by joining with an edge all pairs of nearly identical 
upstream sequences). Finally, we retained for further analysis only one gene per group, chosen at random. 
After this procedure we were left with 6,037 upstream sequences which we retained for further analysis. 



Motifs analyzed 

We considered all possible DNA sequences 5 to 8 base pairs long. Occurrences of each motif were counted on 
both strands: therefore, for example, the motifs CCCCT and AGGGG are considered as the same motif, and 
the number of occurrences of this motif is the sum of the number of occurrences of CCCCT and AGGGG 
on one strand. For palindrome motifs (such as AGGCCT) the number of occurrences is simply the number 
of occurrences on one strand. 



Bacl<ground frequencies 

The determination of which motifs are overrepresented in each upstream region needs the definition of 
a background frequency to which the number of actual occurrences is compared. We chose to use the 
frequency of each motif in the set of all the upstream sequences considered in the study, taken as a single 
sample. Denote by U{g) the length of the upstream region considered for a gene g (500 bp or less in this 
study, see above), and by n(rn,g) the number of occurrences of the motif m in such region. If b(rn) is the 
length of m, the background frequency /(m) of m is defined as 

. Egnim,g) 

f{m) = 1 

where both sums are taken over all the yeast ORFs. In the denominator 

u{g, to) = U (g) — b{m) + 1 
is the number of words of length b(rn) that can be read in the upstream region of g. 



Overrepresented motifs 

The null hypothesis we use to determine whether a motif is overrepresented in the upstream region of a gene 

is the same as in Rcfs. [1] and [2]: motifs arc distributed randomly, each with its own background frequency, 
in the upstream regions. The probability of having n(m, g) or more occurrences of m in the upstream region 
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of g depends on the background frequency of m and the length U{g) of the upstream region of g, and is 
given by the right tail of the binomial distribution: 

P{n{m,g)J{m)M9))= E f "^^^^M I/Ml' [1 " /M]"^''™^"' (2) 

k=n{m,g) 

Notice that the binomial distribution is based on the assumption that the motifs read successively in the 
upstream sequence arc independent of each other; while this is obviously not the case, the motifs we study 
(5 base pairs or longer) are sufficiently rare that this aspect can be safely neglected. 

The motif m is considered overrepresented in the upstream region of g if the probability 

Pinim,g),fim.),U{g)) 

is lower than a certain cutoff. In this study we fixed the cutoff at 0.01. It is important to keep in mind 
that no biological significance is attributed to this overrepresentation by itself: only the motifs that pass the 
functional characterization test described below will be retained in the final results. Therefore the choice 
of this cutoff can be arbitrarily lenient. While it would seem natural to make the cutoff more strict as the 
word length is increased, to take into account the larger number of words analysed, our experience, also 
gained with the microarray analysis of reference [2] shows that many significant results would be lost in this 
way. The result of this analysis is, for each motif m, a set S{m) of genes in whose upstream region m is 
overrepresented. 



Functional characterization of the sets S{m) 

The final step is the analysis of the annotation of the genes included in each set S{m), to select the sets 
characterized by a strong functional characterization and hence the candidate binding sites. For each set 
S{m) we computed the prevalence of all Gene Ontology (GO) terms among the annotated genes in the set, 
and the probability that such prevalence would occur in a randomly chosen set of genes of the same size. We 
always consider a gene annotated to a GO term if it is directly annotated to it or to any of its descendants in 
the GO graph. For a given GO term t let K{t) be the total number of ORFs annotated to it in the genome, 
and k(m,t) the number of ORFs annotated to it in the set S{m). If J and j{m) denote the number of 
ORFs in the genome and in S{m) respectively, such probability is given by the right tail of the appropriate 
hypergeometric distribution: 

mini j(m),K{t)) 

P{J,K{t),j{m),k{m,t))= E F{J,K{t),j{m),h) (3) 

h=k{m,t) 

where 



F{M, m, N, n) = 




In this way a P-value can be associated to each pair made of a motif and a Gene Ontology term. A low 
P-value indicates that the overrepresentation of the motif is correlated to the functional characterization 
described by the GO term, and hence suggests the motif as a candidate binding site for the genes with such 
functional characterization. 



Determining the cutoff on P-values 

Given the huge number of P-values that we compute (in principle equal to the number of GO terms multiplied 
by the number of motifs analysed) it is clear that very low P-values could appear simply by chance. Therefore 



8 



a careful analysis is required to keep under control the number of false positives. The usual way of dealing 
with this issue, that is the Bonferroni correction, is not appropriate, because due to the hierarchical nature 
of the Gene Ontology annotation scheme, the P-values we compute are very far from being independent from 
each other (for example, for a given motif, the P-values associated to terms such as 'cell cycle and DNA 
replication' and 'chromosome cycle' are obviously strongly correlated to each other). We decided therefore 
to adopt an empirical approach allowing us to fix the false discovery rate without any prior assumption 
regarding the distribution of the P-values. 

First, we randomly generated a large number Nr of sets of genes comparable in size to the typical size 
of the sets associated to the motifs. For each of these sets we computed the P-values associated to all the 
GO terms with the same formula used for the "true" sets. For each random set we considered only the best 
P-value obtained for each of the three branches of the Gene Ontology, and we ranked the random sets based 
on this best P-value. In this way we can determine a false discovery probability Pf{C) as a function of the 
cutoff on P-values C, defined as 

«(^) = =f^ (^) 

where nf{C) is the number of random sets whose best P-value is less than C. 

Equipped with the function pf(C), wc can estimate the ratio of false discoveries among our true sets, as 
a function of C: if is the total number of true sets examined, the expected number of false discoveries 
with cutoff C is Npf{C). Therefore if n(C) is the number of true sets with best P-value less than C we can 
estimate the false discovery rate (FDR) to be 

NpfjC) 
n{C) • 

In this way wc can choose C based on the desired false discovery rate. Clearly the lower is the FDR required, 
the higher is the precision required in determining the function Pf{C) for low values of C, and hence the 
number Nr of sets to be generated randomly. In our case we set the false discovery rate at 0.01: a reliable 
estimate of the corresponding cutoffs in P- value required the generation of 3.5 million randomly chosen sets 
of 20 genes (which is the typical size for the true sets corresponding to overrepresented motifs 5 to 8 bp 
long). Simulations performed with other choices of the set size did not difii'er enough to change significantly 
our estimate of the false discovery rate. 

Comparison with experimentaly determined associations between transcription factors and regulated 

genes 

We performed a systematic comparison between the sets of genes identified as significantly associated to 
one or more GO terms and the sets of genes that were determined in Rcf. [5] to be capable of interacting 
with each of 106 known transcription factors. For each of these transcription factors we defined a set of 
regulated genes (called "TF set" in the following) by considering all the genes with P-value less than 0.001 
for interaction with the TF according to the supplementary material of Ref. [5] . Then for each of the sets 
identified as significant by our method we considered its intersection with all the TF sets, and computed a 
related P-value: the probability that a randomly chosen set of genes has an intersection equal to or greater 
(in number of elements) than the one actually found (given by the same hypcrgcomctric distribution used 
to determine the overrepresentation of GO annotations). We considered the result statistically significant 
when this intersection P-value was less than 10~^. 
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Figures 
Figure 1 legend 

Location of the motifs belonging to the consensus GATGAGATGAGCT with respect to the translation and tran- 
scription start sites for 6 genes for which the latter is known. The binding sites are denoted by rectangles above or 
below the line depending on whether the consensus sequence is read on the Crick or Watson strand respectively. The 
vertical bar is the transcription start site, as given in Ref.[20]. 

Additional Files 

Additional file 1 — overrepresentation.txt 

For all sets found significant by the algorithm this table shows the background frequency of the motif, and the number 
of instances required for the motif to be overrepresented in a 500 bp long upstream sequence. For shorter upstream 
sequences less repetitions are needed in general. 

Additional file 2 — occurrences.txt 

The number of occurrences of the motif ACGCG (associated to "replication fork" and "DNA metabolism" ) in all the 
genes belonging to its set, and the length of the corresponding upstream region. 

Additional file 3 — annotation.txt 

For all sets found significant by the algorithm the table reports the corresponding Gene Ontology term and the 
number of genes annotated to the term, compatible and incompatible with the term (see text for the definition of 
"compatible" and "incompatible"). 
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Tables 
Table 1 

Significant motifs associated to nucleolar proteins implicated in ribosome biogenesis. The columns titled "C" , "F" and 
"P" correspond to the three branches of the Gene Ontology: cellular component, molecular function and biological 
process respectively. 



motif 


c 


F 


P 


GATGAGA 


miclcolus 




ribosome biogenesis 


GATGAGAT 


nucleolus 




ribosome biogenesis 


ATGAGAT 


nucleolus 




ribosome biogenesis 


ATGAGATG 






ribosome biogenesis 


TGAGATG 






ribosome biogenesis 
and assembly 


TGAGATGA 






ribosome biogenesis 

and assembly 


GAGATG 






ribosome biogenesis 
and assembly 


GAGATGAG 


nucleolus 




ribosome biogenesis 
and assembly 


GAGATGA 


nucleolus 




ribosome biogenesis 

and assembly 


AGATGAG 


nucleolus 




ribosome biogenesis 


GATGAG 


nucleolus 




ribosome biogenesis 


GATGA 






ribosome biogenesis 


ATGAGCT 


nucleolus 




ribosome biogenesis 


TGAGCT 


nucleolus 




rRNA processing 


GATGAGATGAGC T 




AAAAATT 


nucleolus 




ribosome biogenesis 


AAAAATTT 


nucleolus 

complex 




transcription 
from Pol I promoter 


AAAATT 


nucleolus 




ribosome biogenesis 


AAAATTT 


nucleolus 




ribosome biogenesis 


AAAATTTT 


nucleolus 




ribosome biogenesis 


AAATT 


nucleolus 




35S primary 
transcript processing 


AAATTTTC 


small nucleolar 
ribomiclcoprotcin 
complex 




35S primary 
transcript processing 


AAAAATTTTC 
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Table 2 

Significant motifs associated to ribosomal proteins. Here and in the following tables the column "TF" reports 
the transcription factors studied in Ref. [5] for which the intersection between the experimentally found targets of 
regulation and our sets is statistically significant. 



motif 


C 


F 


P 


TF 


A CCC ATA 


ribosome 
(sensu Eukarya) 


Oil uciuiai 
constituent 
of ribosome 




RAPl 


A CCC A TCiC 


r* T r "1" (~i G 1 1 r* 

ribosome 
(sensu Eukarya) 






PHT.I 
RAPl 


A CCC CiT AC 


ribosome 
(sensu Eukarya) 






RAPl 


n PHP P T A r 


Cy lUbUilC 

ribosome 
(sensu Eukarya) 






■PTTT 1 

J? LLljl. 

RAPl 


c cac c, T 


1 Si Toro 

ribosomal 
subunit 






PHT.I 


ccc n T A r A 

yy \y Kj VJT J- -Tl. \y iT. 


i^y ttJoUill^ 

ribosome 

( QPTlQll P,n Irfl FATfl 1 
1 OCiiO LI J_J LilVCli y Ci I 






FHT.I 
RAPl 


CC ATACAT 


cytosolic 
ribosome 


structural 
constituent 

C\T T'l hf^lQriTTl f 
KJL L 1 UUoUiiiL. 


- 


FHLl 


CC ATACA 


cytosolic 
ribosome 
(sensu Eukarya) 


structural 
constituent 
of ribosome 


protein 
biosynthesis 


FHLl 
RAPl 


GCCTAG 


cytosolic 
ribosome 
(sensu Eukarya) 






FHLl 


GCCTAGAC 


cytosolic 
ribosome 

(sensu Eukarya) 






FHLl 
RAPl 


GCCCA 


cytosol 






FHLl 
RAPl 


CATACAT 


cytosolic 
ribosome 
(sensu Eukarya) 






FHLl 


MCSCMTACAY 
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Table 3 

Significant motifs associated to DNA replication and nucleosome. 



motif 


C 


F 


P 


TF 


AAACGCG 






DNA replication 


MBPl 

b WlD 


AG ACGCGT 


- 


- 


DNA dependent 
DNA replication 




GACGCGT A 


- 


- 


DNA replication and 
chromosome cycle 


MBPl 
SWI6 


A A GGGG 






DNA replication 


MBPl 

bWlD 


G ACGCG 






mitotic sister 
chromatid 

cohesion 


MBPl 
SWI4 

SWI6 


ACGCG 


replication fork 




DNA metabolism 


MBPl 
SWI4 

SWI6 


ACGCGT 






DNA rc^plication 


MBPl 
SWK) 


ACGCGT C G 






mitotic sister 
chromatid 
cohesion 


MBPl 
SWI6 


CGCGT A 






DNA metabolism 


MBPl 


ARACGCGTMG 


AACCGCGT 


nucleosome 






HIRl 
HIR2 


A ATGCGA 


nucleosome 






HIRl 
HIR2 


ATGCGA A 


nucleosome 




chromatin 
assembly/disassembly 


HIRl 
HIR2 


CGCGA 


nucleosome 






MBPl 

SWI4 
SWI6 


GCGC T CG 


nucleosome 








AAMYGCGAWCG 


CGGGGAGA 


nucleosome 
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Table 4 

Other motifs with significant intersection with ChIP data. 



motif 


C 


F 


P 


TF 


(j A i tjrA(j i Kj 






amino acid 

metabolism 


GGiN4 


A i (j A O i 






non-protein 
amino acid 

metabolism 


GG1N4 


GTGAGTCA 






amino acid 
metabolism 


GCN4 


ATGAGTCA 






amino acid 
metabolism 


GCN4 


TGAGTCAC 






amino acid 

biosynthesis 


GCN4 


AGAGTCAT 






amino acid 
m(>t,al)olisiii 




GAGTCA 






amino acid 
biosynthesis 


BASl 
GCN4 


GATGAGTCAY 


TGAAAC 






sexual reproduction 


DIGl 
STE12 


TGAAACA 






sexual reproduction 


DIGl 
STE12 


TGAAACA 


ACTGTG 






sulfur amino 
acid transport 


MET4 


TGTGGC 






sulfur metabolism 


MET4 
MET31 


ACTGTGGC 


TGTAGAA 




heat shock protein 


protein folding 


HSFl 
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Table 5 

Motifs associated to siderophore transport and tricarboxylic acid cycle. 



motif 




T7 

r 


T3 

r 


AGGGTGC 






siderophore 

transport 


AGGGTGCA 






siderophore 
transport 


TGGGTGCA 






siderophore 
transport 


GGGTGCA 






siderophore 

transport 


GGGTGC 






sid(Tophore 
transi)()rt 


GGTGCA 




heavy metal 
ion porter 


siderophore 
transport 


GGTGC 


cell wall 
(sensu Fungi) 






AGGGTGCACC 


CGGCGCC 






tricarboxylic 

acid cycle 


CGGCGCCG 






tricarboxylic 
acid cycle 


GGCGCCGA 






tricarboxylic 
acid cycle 


GCGCCGAG 






tricarboxylic 
acid cycle 


CGGCGCCGAG 
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Table 6 

Other significant motifs associated to transcription factors not studied in Ref. [5]. 



motif 


C 


F 


P 


ACCCC 


alpha, alpha-trehalose- 

phosphate synthase 
complex (UDP-forming) 






CCCCT 


alpha, alpha-trehalosc- 

phosphate synthase 
complex (UDP-forming) 






ACCCCT 


CCGGTGGC 


26S proteasome 


threonine 
(^ndopc^ptidasc 




AGGTGGCA 


2()S i)rotcHh(jiu(> 






CGGTGGCA 


26S proteasome 


proteasome 
endopeptidase 




GGGTGGCA 


26S proteasome 


proteasome 
endopeptidase 




CCGGTGGCA 


AAACAAA 


spindle 






TAAACGA 






steroid metabolism 



17 



Table 7 

Genome prevalence of the consensus sequences constructed by aligning the motifs found significant by the algorithm, 
and functional characterization of the corresponding genes. The second column contains the total number of genes in 

the yeast genome in which the consensus sequence is found at least once in the upstream region. The third column 
reports one of the annotation terms to which the consensus sequence is associated, and the number of genes annotated 
to the term among the ones containing the consensus sequence. 



Consensus sequence 


Total genes 


Annotated genes 


GATGAGATAGCT 


17 


nucleolus (8) 


AAAAATTTTC 


233 


nucleolus (37) 


MCSCMTACAY 


78 


cytosolic ribosomc (20) 


ARACGCGTMG 


15 


DNA replication and chromosome cycle (2) 


AAMYGCGAWCG 


6 


nucleosome (0) 


GATGAGTCAY 


5 


amino-acid metabolism (3) 


TGAAACA 


560 


sexual reproduction (26) 


AGGGTGCA 


40 


siderophore transport (5) 


CGGCGCCGAG 


5 


tricarboxylic acid cycle (3) 


ACCCCT 


434 


alpha, alpha-trehalose phosphate 






synthase complex (UDP-forming) (2) 


CCGGTGGC 


17 


26S proteasome (6) 



Table 8 

Candidate targets of regulation by the UPC2p and ECM22p transcription factors. 



ATF2 


(YGR177C) 


ECM22 


(YLR228C) 


ERG3 


(YLR056W) 


ERG4 


(YGL012W) 


ERG5 


(YMR015C) 


ERG7 


(YHR072W) 


ERG26 


(YGLOOIC) 
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Figure 1: Location of the motifs belonging to the consensus GATGAGATGAGCT with respect to the 
translation and transcription start sites for 6 genes for which the latter is known. The binding sites are 
denoted by rectangles above or below the line depending on whether the consensus sequence is read on the 
Crick or Watson strand respectively. The vertical bar is the transcription start site, as given in Ref. [20]. 



Table 9 

Candidate targets of regulation by the Hcmlp transcription factor. 



MPSl 


(YDL028C) 


CIN8 


(YEL061C) 


PDSl 


(YDR113C) 


SPC98 


(YNL126W) 


VIKl 


(YPL253C) 


SPC25 


(YER018C) 


ESPl 


(YGR098C) 


STU2 


(YLR045C) 


SLI15 


(YBR156C) 



ANP1 



TRK2 



VMA10 



RNT1 



PMI40 



RPA190 
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